Override `drop` for a custom sequence
Posted
by
Bruno Reis
on Stack Overflow
See other posts from Stack Overflow
or by Bruno Reis
Published on 2012-10-13T06:01:34Z
Indexed on
2012/10/13
9:37 UTC
Read the original article
Hit count: 159
In short: in Clojure, is there a way to redefine a function from the standard sequence API (which is not defined on any interface like ISeq, IndexedSeq, etc) on a custom sequence type I wrote?
1. Huge data files
I have big files in the following format:
- A long (8 bytes) containing the number
n
of entries n
entries, each one being composed of 3 longs (ie, 24 bytes)
2. Custom sequence
I want to have a sequence on these entries. Since I cannot usually hold all the data in memory at once, and I want fast sequential access on it, I wrote a class similar to the following:
(deftype DataSeq [id
^long cnt
^long i
cached-seq]
clojure.lang.IndexedSeq
(index [_] i)
(count [_] (- cnt i))
(seq [this] this)
(first [_] (first cached-seq))
(more [this] (if-let [s (next this)] s '()))
(next [_] (if (not= (inc i) cnt)
(if (next cached-seq)
(DataSeq. id cnt (inc i) (next cached-seq))
(DataSeq. id cnt (inc i)
(with-open [f (open-data-file id)]
; open a memory mapped byte array on the file
; seek to the exact position to begin reading
; decide on an optimal amount of data to read
; eagerly read and return that amount of data
))))))
The main idea is to read ahead a bunch of entries in a list and then consume from that list. Whenever the cache is completely consumed, if there are remaining entries, they are read from the file in a new cache list. Simple as that.
To create an instance of such a sequence, I use a very simple function like:
(defn ^DataSeq load-data [id]
(next (DataSeq. id (count-entries id) -1 [])))
; count-entries is a trivial "open file and read a long" memoized
As you can see, the format of the data allowed me to implement count
in very simply and efficiently.
3. drop
could be O(1)
In the same spirit, I'd like to reimplement drop
. The format of these data files allows me to reimplement drop
in O(1) (instead of the standard O(n)), as follows:
if dropping less then the remaining cached items, just drop the same amount from the cache and done;
if dropping more than
cnt
, then just return the empty list.otherwise, just figure out the position in the data file, jump right into that position, and read data from there.
My difficulty is that drop
is not implemented in the same way as count
, first
, seq
, etc. The latter functions call a similarly named static method in RT
which, in turn, calls my implementation above, while the former, drop
, does not check if the instance of the sequence it is being called on provides a custom implementation.
Obviously, I could provide a function named anything but drop
that does exactly what I want, but that would force other people (including my future self) to remember to use it instead of drop
every single time, which sucks.
So, the question is: is it possible to override the default behaviour of drop
?
4. A workaround (I dislike)
While writing this question, I've just figured out a possible workaround: make the reading even lazier. The custom sequence would just keep an index and postpone the reading operation, that would happen only when first
was called. The problem is that I'd need some mutable state: the first call to first
would cause some data to be read into a cache, all the subsequent calls would return data from this cache. There would be a similar logic on next
: if there's a cache, just next
it; otherwise, don't bother populating it -- it will be done when first
is called again.
This would avoid unnecessary disk reads. However, this is still less than optimal -- it is still O(n), and it could easily be O(1).
Anyways, I don't like this workaround, and my question is still open. Any thoughts?
Thanks.
© Stack Overflow or respective owner